This week we are launching into RStudio and R Markdown. You should have R, RStudio, and RTools (Windows OS) installed on your machine by now. Because R Notebooks and R Markdown will be the primary platform for writing and sharing code in this class, it is a good idea for us to build familiarity with it ASAP. This exercise is whimsical but also introduces several of the R Markdown formatting conventions that we will come to rely on and gives you a reason to practice what you read for today’s class. To wit this HTML file was produced by knitting the .Rmd file into the format specified in the YAML header—this will typically be an HTML file for us.


Some people think that I only like nerdy comics like xkcd, but they are mistaken. I also have a deep appreciation for what has come before. Take, for instance, Garfield by Jim Davis.


The table below lists the main characters, for those who may be Garfield noobs!

Name Description
Garfield Cat
Odie Dog
Jon Human
Nermal Cat?

There are several R packages designed to help you create better looking tables in R Markdown and we will introduce a couple of those over the coming weeks (e.g., kable).

Reasons Garfield Is Awesome

There many reasons that Garfield is a great character. Allow me to explain in bulleted list form…



Fun Facts About Garfield

There are lots of interesting things that you may not know about Garfield that you probably should know about Garfield. Here’s one… Did you know that Muncie, Indiana is the setting for the comic strip? Muncie also happens to be the home of Ball State University.


The image above is from last week, so this strip is still going strong!


From Comic Strips to Comic Books

Some of you may also be fans of comic books and the Marvel Cinematic Universe has really become embedded in American popular culture. The data included in the file marvel-wikia-data.csv was scraped in 2014 and was used in a FiveThirtyEight story about gender bias in the comic book industry. I realize that we have not yet introduced the dplyr package, but you did look at the piece on the readr package today. Let’s import this comic book characters dataset and poke around a little…

# install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.1
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
data_0 <- read_csv("marvel-wikia-data.csv")
## Rows: 16376 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, urlslug, ID, ALIGN, EYE, HAIR, SEX, GSM, ALIVE, FIRST APPEAR...
## dbl  (3): page_id, APPEARANCES, Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data_0)
## spec_tbl_df [16,376 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ page_id         : num [1:16376] 1678 7139 64786 1868 2460 ...
##  $ name            : chr [1:16376] "Spider-Man (Peter Parker)" "Captain America (Steven Rogers)" "Wolverine (James \\\"Logan\\\" Howlett)" "Iron Man (Anthony \\\"Tony\\\" Stark)" ...
##  $ urlslug         : chr [1:16376] "\\/Spider-Man_(Peter_Parker)" "\\/Captain_America_(Steven_Rogers)" "\\/Wolverine_(James_%22Logan%22_Howlett)" "\\/Iron_Man_(Anthony_%22Tony%22_Stark)" ...
##  $ ID              : chr [1:16376] "Secret Identity" "Public Identity" "Public Identity" "Public Identity" ...
##  $ ALIGN           : chr [1:16376] "Good Characters" "Good Characters" "Neutral Characters" "Good Characters" ...
##  $ EYE             : chr [1:16376] "Hazel Eyes" "Blue Eyes" "Blue Eyes" "Blue Eyes" ...
##  $ HAIR            : chr [1:16376] "Brown Hair" "White Hair" "Black Hair" "Black Hair" ...
##  $ SEX             : chr [1:16376] "Male Characters" "Male Characters" "Male Characters" "Male Characters" ...
##  $ GSM             : chr [1:16376] NA NA NA NA ...
##  $ ALIVE           : chr [1:16376] "Living Characters" "Living Characters" "Living Characters" "Living Characters" ...
##  $ APPEARANCES     : num [1:16376] 4043 3360 3061 2961 2258 ...
##  $ FIRST APPEARANCE: chr [1:16376] "Aug-62" "Mar-41" "Oct-74" "Mar-63" ...
##  $ Year            : num [1:16376] 1962 1941 1974 1963 1950 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   page_id = col_double(),
##   ..   name = col_character(),
##   ..   urlslug = col_character(),
##   ..   ID = col_character(),
##   ..   ALIGN = col_character(),
##   ..   EYE = col_character(),
##   ..   HAIR = col_character(),
##   ..   SEX = col_character(),
##   ..   GSM = col_character(),
##   ..   ALIVE = col_character(),
##   ..   APPEARANCES = col_double(),
##   ..   `FIRST APPEARANCE` = col_character(),
##   ..   Year = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
View(data_0)

The read_csv function is part of the readr package and we use it to import—you guessed it—.csv files. Note that the str function displays the structure of an object while the View function allows us to interact with the data in a separate window.


The reading for next time goes into further detail regarding the dplyr package which is the workhorse for data wrangling. It is fundamental to the work we will do this semester, and so we may as well start familiarizing ourselves with it.

data <- data_0 %>%
  drop_na(SEX)

dim(data_0)
## [1] 16376    13
dim(data)
## [1] 15522    13
data %>% 
  group_by(SEX) %>%
  count()
data %>% 
  count(SEX) %>%
  mutate(percent = (n / sum(n)) * 100)

In the above code chunk, we use the tidyr::drop_na function to remove observations (i.e., rows) in the dataset that do not have a value for the SEX attribute. We then use the base R function dim which is short for dimensions, to compare the number of rows and columns before and after we perform that operation.

Next, we use the pipe operator %>% to link multiple functions together—this allows us to write fewer lines of code and (arguably) makes it easier to understand what is happening! You read the code from top to bottom. The data object above has the dplyr::group_by function applied to it such that observations (i.e., rows) are grouped according to the value of this attribute, then the result is passed to the dplyr::count function. Because there is no <- operator, the resulting table is displayed but it is not stored in an object that we can go back to later. By default, the count function generates a new attribute (i.e., column) called n which can be referenced in subsequent functions that are part of the sequence and linked though the %>% operator.

The final bit of the above code chunk eliminates the group_by component and instead applies the count function directly to the SEX attribute. Then, the dplyr::mutate function is used to create a new attribute (i.e., column) alongside the n attribute that contains the percentage value. Again, because there is no <- operator, the resulting table is displayed but it is not stored in an object that we can go back to later.

This quick analysis shows that there are far fewer male characters, but we could also ask if the number of appearances is more or less skewed.

Your Turn


Insert a new code chunk then try to modify the preceding code to determine:

  • How the percentage of appearances varies across gender designations
  • If there is more balance in the percentage of appearances if we limit the analysis to characters originating since 2010

Keeping in mind that these data are circa 2014, you can limit the dataset to characters introduced in 2010 or later like this data_2010_2014 <- data %>% filter(Year >= 2010) Hint: you will probably want to create a standalone object that contains the total number of appearances for use as the denominator in your calculations. You can access the function reference for the dplyr package here.


If you are hungry for more dplyr try rerunning the same code for the DC Comics dataset included with this assignment (i.e., dc-wikia-data.csv).

Why Do I Need RTools?

The last thing I want to introduce here is the rationale for installing RTools (or Xcode Command Line Tools if you have a Mac). Sometimes we want to access R packages that are not available on the official CRAN mirror sites and usually that means downloading and compiling from a platform like GitHub. In the chunk below, we set out preferred CRAN mirror in the code, then install the devtools package, which allows us to pull packages like emo from GitHub. The emo package allows us to insert emoji into our R Notebooks, which will really increase your enjoyment and quality of life.

options(repos=c(CRAN="https://mirrors.nics.utk.edu/cran/")) 
install.packages("devtools")
## package 'devtools' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\bw6xs\AppData\Local\Temp\RtmpoBVwQq\downloaded_packages
devtools::install_github("hadley/emo")
library(emo)

If we did not have RTools (or Xcode Command Line Tools if you have a Mac) installed, this part wouldn’t work and the mood would be decidedly 😢

Take a look at this page to get a sense of which keywords are associated with your fave emoji 😮 but you should know that if there are multiple emoji associated with a given keyword, RStudio randomly grabs one each time.



You have reached the end!